Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words

نویسندگان

Helena Gómez-Adorno

Ilia Markov

Jorge Baptista

Grigori Sidorov

David Pinto

چکیده

This paper presents the CIC UALG’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

We compare the performance of character n-gram features (n = 3–8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We...

متن کامل

ASIREM Participation at the Discriminating Similar Languages Shared Task 2016

This paper presents the system built by ASIREM team for the Discriminating between Similar Languages (DSL) Shared task 2016. It describes the system which uses character-based and word-based n-grams separately. ASIREM participated in both sub-tasks (sub-task 1 and subtask 2) and in both open and closed tracks. For the sub-task 1 which deals with Discriminating between similar languages and nati...

متن کامل

A Simple Baseline for Discriminating Similar Languages

This paper describes an approach to discriminating similar languages using wordand characterbased features, submitted as the Queen Mary University of London entry to the Discriminating Similar Languages shared task. Our motivation was to investigate how well a simple, datadriven, linguistically naive method could perform, in order to provide a baseline by which more linguistically complex or kn...

متن کامل

Author Clustering using Hierarchical Clustering Analysis

This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to reduce the dimensionality. Our system was ranke...

متن کامل

Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling

We present the CIC’s approach to the Author Profiling (AP) task at PAN 2017. This year task consists of two subtasks: gender and language variety identification in English, Spanish, Portuguese, and Arabic. We use typed and untyped character n-grams, word n-grams, and non-textual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized freq...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words

نویسندگان

چکیده

منابع مشابه

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

ASIREM Participation at the Discriminating Similar Languages Shared Task 2016

A Simple Baseline for Discriminating Similar Languages

Author Clustering using Hierarchical Clustering Analysis

Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling

عنوان ژورنال:

اشتراک گذاری